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Survivable  Network  Systems: 
An  Emerging  Discipline 


Abstract:  Society  is  growing  increasingiy  dependent  upon  iarge-scale, 
highiy  distributed  systems  that  operate  in  unbounded  network 
environments.  Unbounded  networks,  such  as  the  Internet,  have  no 
central  administrative  control  and  no  unified  security  poiicy.  Furthermore, 
the  number  and  nature  of  the  nodes  connected  to  such  networks  cannot 
be  fully  known.  Despite  the  best  efforts  of  security  practitioners,  no 
amount  of  system  hardening  can  assure  that  a  system  that  is  connected 
to  an  unbounded  network  will  be  invulnerable  to  attack.  The  discipline  of 
survivability  can  help  ensure  that  such  systems  can  deliver  essential 
services  and  maintain  essential  properties  such  as  integrity, 
confidentiality,  and  performance,  despite  the  presence  of  intrusions. 
Unlike  the  traditional  security  measures  that  require  central  control  or 
administration,  survivability  is  intended  to  address  unbounded  network 
environments.  This  report  describes  the  survivability  approach  to  helping 
assure  that  a  system  that  must  operate  in  an  unbounded  network  is 
robust  in  the  presence  of  attack  and  will  survive  attacks  that  result  in 
successful  intrusions.  Included  are  discussions  of  survivability  as  an 
integrated  engineering  framework,  the  current  state  of  survivability 
practice,  the  specification  of  survivability  requirements,  strategies  for 
achieving  survivability,  and  techniques  and  processes  for  analyzing 
survivability. 


1 .  Survivability  in  Network  Systems 

Contemporary  large-scale  networked  systems  that  are  highly  distributed  improve  the 
efficiency  and  effectiveness  of  organizations  by  permitting  whole  new  levels  of 
organizational  integration.  However,  such  integration  is  accompanied  by  elevated  risks 
of  intrusion  and  compromise.  These  risks  can  be  mitigated  by  incorporating  survivability 
capabilities  into  an  organization’s  systems.  As  an  emerging  discipline,  survivability  builds 
on  related  fields  of  study  (e.g.,  security,  fault  tolerance,  safety,  reliability,  reuse, 
performance,  verification,  and  testing)  and  introduces  new  concepts  and  principles. 
Survivability  focuses  on  preserving  essential  services  in  unbounded  environments,  even 
when  systems  in  such  environments  are  penetrated  and  compromised  [Anderson  97]. 
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1 .1  The  New  Network  Paradigm:  Organizational  Integration 


From  their  modest  beginnings  some  20  years  ago,  computer  networks  have  become  a 
critical  element  of  modern  society.  These  networks  not  only  have  global  reach,  they  also 
have  Impact  on  virtually  every  aspect  of  human  endeavor.  Network  systems  are  principal 
enabling  agents  in  business,  industry,  government,  and  defense.  Major  economic 
sectors,  including  defense,  energy,  transportation,  telecommunications,  manufacturing, 
financial  services,  health  care,  and  education,  all  depend  on  a  vast  array  of  networks 
operating  on  local,  national,  and  global  scales.  This  pervasive  societal  dependency  on 
networks  magnifies  the  consequences  of  intrusions,  accidents,  and  failures,  and 
amplifies  the  critical  importance  of  ensuring  network  survivability. 

As  organizations  seek  to  improve  efficiency  and  competitiveness,  a  new  network 
paradigm  is  emerging.  Networks  are  being  used  to  achieve  radical  new  levels  of 
organizational  Integration.  This  integration  obliterates  traditional  organizational 
boundaries  and  transforms  local  operations  into  components  of  comprehensive, 
network-resident  business  processes.  For  example,  commercial  organizations  are 
integrating  operations  with  business  units,  suppliers,  and  customers  through  large-scale 
networks  that  enhance  communication  and  services.  These  networks  combine 
previously  fragmented  operations  into  coherent  processes  open  to  many  organizational 
participants.  This  new  paradigm  represents  a  shift  from  bounded  networks  with  central 
control  to  unbounded  networks.  Unbounded  networks  are  characterized  by  distributed 
administrative  control  without  central  authority,  limited  visibility  beyond  the  boundaries  of 
local  administration,  and  lack  of  complete  information  about  the  network.  At  the  same 
time,  organizational  dependencies  on  networks  are  increasing  and  risks  and 
consequences  of  intrusions  and  compromises  are  amplified. 

1 .2  The  Definition  of  Survivabiiity 

We  define  survivability  as  the  capability  of  a  system  to  fulfill  its  mission,  in  a  timely 
manner,  in  the  presence  of  attacks,  failures,  or  accidents.  We  use  the  term  system  in  the 
broadest  possible  sense,  including  networks  and  large-scale  systems  of  systems. 

The  term  mission  refers  to  a  set  of  very  high-level  (i.e.,  abstract)  requirements  or  goals. 
Missions  are  not  limited  to  military  settings  since  any  successful  organization  or  project 
must  have  a  vision  of  its  objectives  whether  expressed  implicitly  or  as  a  formal  mission 
statement.  Judgments  as  to  whether  or  not  a  mission  has  been  successfully  fulfilled  are 
typically  made  in  the  context  of  external  conditions  that  may  affect  the  achievement  of 
that  mission.  For  example,  assume  that  a  financial  system  shuts  down  for  12  hours 
during  a  period  of  widespread  power  outages  caused  by  a  hurricane.  If  the  system 
preserves  the  integrity  and  confidentiality  of  its  data  and  resumes  its  essential  services 
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after  the  period  of  environmental  stress  is  over,  the  system  can  reasonably  be  judged  to 
have  fulfilled  its  mission.  However,  if  the  same  system  shuts  down  unexpectedly  for  12 
hours  under  normal  conditions  (or  under  relatively  minor  environmental  stress)  and 
deprives  its  users  of  essential  financial  services,  the  system  can  reasonably  be  judged  to 
have  failed  its  mission,  even  if  data  integrity  and  confidentiality  are  preserved. 

Timeliness  is  a  critical  factor  that  is  typically  included  in  (or  implied  by)  the  very  high- 
level  requirements  that  define  a  mission.  However,  timeliness  is  such  an  important  factor 
that  we  included  it  explicitly  in  the  definition  of  survivability. 

The  terms  attack,  failure,  and  accident  are  meant  to  include  all  potentially  damaging 
events;  but  these  terms  do  not  partition  these  events  into  mutually  exclusive  or  even 
distinguishable  sets.  It  is  often  difficult  to  determine  if  a  particular  detrimental  event  is  the 
result  of  a  malicious  attack,  a  failure  of  a  component,  or  an  accident.  Even  if  the  cause  is 
eventually  determined,  the  critical  immediate  response  cannot  depend  on  such 
speculative  future  knowledge. 

Attacks  are  potentially  damaging  events  orchestrated  by  an  intelligent  adversary.  Attacks 
include  intrusions,  probes,  and  denial  of  service.  Moreover,  the  threat  of  an  attack  may 
have  as  severe  an  impact  on  a  system  as  an  actual  occurrence.  A  system  that  assumes 
a  defensive  position  because  of  the  threat  of  an  attack  may  reduce  its  functionality  and 
divert  additional  resources  to  monitoring  the  environrnent  and  protecting  system  assets. 

We  include  failures  and  accidents  as  part  of  survivability.  Failures  are  potentially 
damaging  events  caused  by  deficiencies  in  the  system  or  in  an  external  element  on 
which  the  system  depends.  Failures  may  be  due  to  software  design  errors,  hardware 
degradation,  human  errors,  or  corrupted  data.  Accidents  describe  the  broad  range  of 
randomly  occurring  and  potentially  damaging  events  such  as  natural  disasters.  We  tend 
to  think  of  accidents  as  externally  generated  events  (i.e.,  outside  the  system)  and 
failures  as  internally  generated  events. 

With  respect  to  system  survivability,  a  distinction  between  a  failure  and  an  accident  is 
less  important  than  the  impact  of  the  event.  Nor  is  it  often  possible  to  distinguish 
between  intelligently  orchestrated  attacks  and  unintentional  or  randomly  occurring 
detrimental  events.  Our  approach  concentrates  on  the  effect  of  a  potentially  damaging 
event.  Typically,  for  a  system  to  survive,  it  must  react  to  (and  recover  from)  a  damaging 
effect  (e.g.,  the  integrity  of  a  database  is  compromised)  long  before  the  underlying  cause 
is  identified.  In  fact,  the  reaction  and  recovery  must  be  successful  whether  or  not  the 
cause  is  ever  determined. 
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Our  primary  focus  in  this  report  is  to  help  systems  survive  the  acts  of  intelligent 
adversaries.  This  bias  is  based  on  the  nature  of  the  organization  to  which  the  authors 
belong.  Our  Survivable  Network  Technology  Team  is  an  outgrowth  of  the  CERT 
Coordination  Center,  which  has  been  helping  users  respond  to  and  recover  from 
computer  security  incidents  since  1 988. 

Finally,  it  is  important  to  recognize  that  it  is  the  mission  fulfillment  that  must  survive,  not 
any  particular  subsystem  or  system  component.  Central  to  the  notion  of  survivability  is 
the  capability  of  a  system  to  fulfill  its  mission,  even  if  significant  portions  of  the  system 
are  damaged  or  destroyed.  We  will  sornetimes  use  the  term  survivable  system  as  a  less 
than  perfectly  precise  shorthand  for  a  system  with  the  capability  to  fulfill  a  specified 
mission  in  the  face  of  attacks,  failures,  or  accidents.  Again,  it  is  the  mission,  not  a 
particular  portion  of  the  system,  that  must  survive. 

1.3  The  Domain  of  Survivability:  Unbounded  Networks 

The  success  of  a  survivable  system  depends  on  the  computing  environment  in  which  the 
survivable  system  operates.  The  trend  in  networked  computing  environments  is  toward 
largely  unbounded  network  infrastructures.  A  bounded  system  is  one  in  which  all  of  the 
system’s  parts  are  controlled  by  a  unified  administration  and  can  be  completely 
characterized  and  controlled.  At  least  in  theory,  the  behavior  of  a  bounded  system  can 
be  understood  and  all  of  its  various  parts  identified.  In  an  unbounded  system  there  is  no 
unified  administrative  control  over  its  parts.  We  use  the  term  administrative  control \r\  the 
strictest  sense,  which  includes  the  power  to  impose  and  enforce  sanctions  and  not 
simply  to  recommend  an  appropriate  security  policy.  In  an  unbounded  system,  each 
participant  has  an  incomplete  view  of  the  whole,  must  depend  on  and  trust  information 
supplied  by  its  neighbors,  and  cannot  exercise  control  outside  its  local  domain. 
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An  unbounded  system  can  be  composed  of  bounded  and  unbounded  systems 
connected  together  in  a  network.  Figure  1  illustrates  an  unbounded  domain  consisting  of 
a  collection  of  bounded  systems  in  which  each  bounded  system  is  under  separate 
administrative  control.  Although  the  security  policy  of  an  individual  bounded  system 
cannot  be  fully  enforced  outside  of  the  boundaries  of  its  administrative  control,  the  policy 
can  be  used  as  a  yardstick  to  evaluate  the  security  state  of  that  bounded  system.  Of 
course,  the  security  policy  can  be  advertised  outside  of  the  bounded  system;  but 
administrators  are  severely  limited  in  their  ability  to  compel  or  persuade  outside 
individuals  or  entities  to  follow  it.  This  limitation  is  particularly  true  when  an  unbounded 
domain  spans  jurisdictional  boundaries,  making  legal  sanctions  difficult  or  impossible  to 
impose. 


Figure  1:  An  Unbounded  Domain  Viewed  as  a  Coiiection  of  Bounded  Systems 


5 


CMU/SEI-97-TR-013 


When  an  application  or  software-intensive  system  is  exposed  to  an  environment 
consisting  of  multiple,  unpredictable  administrative  domains  with  no  measurable  bounds, 
the  system  has  an  unbounded  environment.  An  unbounded  environment  exhibits  the 
following  properties: 

•  multiple  administrative  domains  with  no  central  authority 

•  an  absence  of  global  visibility  (i.e.,  the  number  and  nature  of  the  nodes  in  the 
network  cannot  be  fully  known) 

•  interoperability  between  administrative  domains  determined  by  convention 

•  widely  distributed  and  interoperable  systems 

•  users  and  attackers  can  be  peers  in  the  environment 

•  cannot  be  partitioned  into  a  finite  number  of  bounded  environments 

The  Internet  is  an  example  of  an  unbounded  environment  with  many  client-server 
network  applications.  A  public  Web  server  and  its  clients  may  exist  within  many  different 
administrative  domains  on  the  Internet;  yet  there  exists  no  central  authority  that  requires 
all  clients  to  be  configured  in  a  way  expected  by  the  Web  server.  In  particular,  a  Web 
server  can  never  rely  on  a  set  of  client  plug-ins  to  be  present  or  absent  for  any  function 
that  the  server  may  want  to  provide. 

For  a  Web  server  providing  a  financial  transaction  (e.g.,  for  a  Web-based  purchase),  the 
Web  server  may  require  that  the  user  install  a  plug-in  on  the  client  to  support  a  secure 
transaction.  However,  due  to  the  unbounded  nature  of  the  environment,  previously 
installed  plug-ins  from  a  competitor  may  be  present  on  the  client  that  may  corrupt, 
subvert,  or  damage  the  Web  server  during  the  transaction.  For  the  Web  server  to  be 
survivable,  there  must  be  built-in  protection  from  malicious  client  interactions  and  these 
protections  must  make  no  assumptions  about  the  configuration  or  features  of  the  remote 
client. 

In  this  example,  the  Web  server  and  its  clients  make  up  the  system.  The  multiple 
administrative  domains  are  the  variety  of  site  domains  on  the  Internet.  Many  of  these 
domains  have  legitimate  users.  Other  sites  are  used  for  intrusions  in  an  anonymous 
setting.  These  latter  sites  cannot  be  distinguished  by  their  administrative  domain,  but 
only  by  client  behavior.  The  interoperability  between  the  server  and  its  clients  is  defined 
by  http  (hypertext  transfer  protocol),  a  convention  agreed  upon  between  the  server  and 
clients.  The  system,  composed  of  Web  servers  and  clients,  is  widely  distributed  both 
geographically  and  logically  throughout  the  Internet.  Legitimate  users  and  attackers  are 
peers  in  the  environment  and  there  is  no  method  to  isolate  legitimate  users  from  the 
attackers.  In  other  words,  there  is  no  way  to  bound  the  environment  to  legitimate  users 
using  only  a  common  administrative  policy. 
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Unbounded  systems  are  a  significant  component  of  today’s  computing  environment  and 
will  play  an  even  larger  role  in  the  future.  The  Internet— a  non-hlerarchical  network  of 
systems,  each  under  local  administrative  control  only — is  a  primary  example  of  an 
unbounded  system.  While  conventions  exist  that  allow  the  parts  of  the  Internet  to  work 
together,  there  is  no  global  administrative  control  to  assure  that  these  parts  behave 
according  to  these  conventions.  Therefore,  security  problems  abound.  Unfortunately,  the 
security  problems  associated  with  unbounded  systems  are  typically  underestimated. 

1 .4  Characteristics  of  Survivable  Systems 


A  key  characteristic  of  survivable  systems  is  their  capability  to  deliver  essential  services 
in  the  face  of  attack,  failure,  or  accident. 

Central  to  the  delivery  of  essential  services  is  the  capability  of  a  system  to  maintain 
essential  properties  (i.e.,  specified  levels  of  integrity,  confidentiality,  performance,  and 
other  quality  attributes)  in  the  presence  of  attack,  failure,  or  accident.  Thus,  it  is 
important  to  define  minimum  levels  of  quality  attributes  that  must  be  associated  with 
essential  services.  For  example,  a  launch  of  a  missile  by  a  defensive  system  is  no  longer 
effective  if  the  system  performance  is  slowed  to  the  point  that  the  target  is  out  of  range 
before  the  system  can  launch. 

These  quality  attributes  are  so  important  that  definitions  of  survivability  are  often 
expressed  in  terms  of  maintaining  a  balance  among  multiple  quality  attributes  such  as 
performance,  security,  reliability,  availability,  fault-tolerance,  modifiability,  and 
affordability.  The  Architecture  Tradeoff  Analysis  project  at  the  Software  Engineering 
Institute  is  using  this  attribute-balancing  (i.e.,  tradeoff)  view  of  survivability  to  evaluate 
and  synthesize  survivable  systems  [Kazman  98].  Quality  attributes  represent  broad 
categories  of  related  requirements,  so  a  quality  attribute  may  contain  other  quality 
attributes.  For  example,  the  security  attribute  traditionally  includes  the  three  attributes: 
confidentiality,  integrity,  and  availability. 

The  capability  to  deliver  essential  services  (and  maintain  the  associated  essential 
properties)  must  be  sustained  even  if  a  significant  portion  of  the  system  is  incapacitated. 
Furthermore,  this  capability  should  not  be  dependent  upon  the  survival  of  a  specific 
information  resource,  computation,  or  communication  link.  In  a  military  setting,  essential 
services  might  be  those  required  to  maintain  an  overwhelming  technical  superiority,  and 
essential  properties  may  include  Integrity,  confidentiality,  and  a  level  of  performance 
sufficient  to  deliver  results  in  less  than  one  decision  cycle  of  the  enemy.  In  the  public 
sector,  a  survivable  financial  system  is  one  that  maintains  the  integrity,  confidentiality, 
and  availability  of  essential  information  and  financial  services,  even  if  particular  nodes  or 
communication  links  are  incapacitated  through  intrusion  or  accident,  and  that  recovers 
compromised  information  and  services  in  a  timely  manner.  The  financial  system’s 
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survivability  might  be  judged  by  using  a  composite  measure  of  the  disruption  of  stock 
trades  or  bank  transactions  (i.e.,  a  measure  of  the  disruption  of  essential  services). 

Key  to  the  concept  of  survivability,  then,  is  identifying  the  essential  services  (and  the 
essential  properties  that  support  them)  within  an  operational  system.  Essential  services 
are  defined  as  the  functions  of  the  system  that  must  be  maintained  when  the 
environment  is  hostile  or  failures  or  accidents  are  detected  that  threaten  the  system. 
There  are  typically  many  services  that  can  be  temporarily  suspended  when  a  system  is 
dealing  with  an  attack  or  other  extraordinary  environmental  condition.  Such  a 
suspension  can  help  isolate  areas  affected  by  an  intrusion  and  free  system  resources  to 
deal  with  its  effects.  The  overall  function  of  a  system  should  adapt  to  preserve  essential 

services. 

We  have  linked  the  capability  of  a  survivable  system  to  fulfill  its  mission  in  a  timely 
manner  to  its  ability  to  deliver  essential  services  in  the  presence  of  attack,  accident,  or 
failure.  Ultimately,  mission  fulfillment  must  survive,  not  any  portion  or  component  of  the 
system.  If  an  essential  service  is  lost,  it  can  be  replaced  by  another  service  that  supports 
mission  fulfillment  in  a  different  but  equivalent  way.  However,  we  still  believe  that  the 
identification  and  protection  of  essential  services  is  an  important  part  of  a  practical 
approach  to  building  and  analyzing  survivable  systems.  As  a  result,  we  define  essential 
services  to  include  alternate  sets  of  essential  services  (perhaps  mutually  exclusive)  that 
need  not  be  simultaneously  available.  For  example,  a  set  of  essential  services  to  support 
power  delivery  may  include  both  the  distribution  of  electricity  and  the  operation  of  a 
natural  gas  pipeline. 
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To  maintain  their  capabilities  to  deliver  essentiai  services,  survivabie  systems  must 
exhibit  the  four  key  properties  illustrated  in  Table  1 ; 


Key  Property 

Description 

Example 

Resistance  to  attacks 

strategies  for  repelling  attacks 

user  authentication 

stochastic  diversity  of 

programs 

Recognition  of  attacks  and  the 
extent  of  damage 

strategies  for  detecting  attacks 
(including  intrusions)  and 
understanding  the  current 
state  of  the  system,  including 
evaluating  the  extent  of 
damage 

recognition  of  intrusion  usage 
patterns 

internal  integrity  checking 

Recovery  of  full  and  essential 
services  after  attack 

strategies  for  restoring 
compromised  information  or 
functionality,  limiting  the  extent 
of  damage,  maintaining  or,  if 
necessary,  restoring  essential 
services  within  the  time 
constraints  of  the  mission, 
restoring  full  service  as 
conditions  permit 

replication  and  reinitialization 
of  data 

Adaptation  and  evolution  to 
reduce  effectiveness  of  future 

attacks 

strategies  for  improving 
system  survivability  based  on 
knowledge  gained  from 
intrusions 

incorporation  of  new  patterns 
for  intrusion  recognition 

Table  1:  The  Key  Properties  of  Survivabie  Systems 
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1.5  Survivability  as  an  Integrated  Engineering  Framework 


As  a  broadly-based  engineering  paradigm,  survivability  is  a  natural  framework  for 
integrating  established  and  emerging  software  engineering  disciplines  in  the  service  of  a 
common  goal.  These  established  areas  of  software  engineering,  which  are  related  to 
survivability,  include  security,  fault  tolerance,  safety,  reliability,  reuse,  performance, 
verification,  and  testing.  Research  in  survivability  encompasses  a  wide  variety  of 
research  methods,  including  the  investigation  of 

•  analogues  to  the  immunological  functioning  of  an  individual  organism 

•  sociological  analogues  to  public  health  efforts  at  the  community  level 

1.5.1  Survivability  and  Security 

The  discipline  of  computer  security  has  made  valuable  contributions  to  the  protection 
and  integrity  of  information  systems  over  the  past  three  decades.  However,  computer 
security  has  traditionally  been  used  as  a  binary  term  that  suggests  that  at  any  moment  in 
time  a  system  is  either  safe  or  compromised.  We  believe  that  this  use  of  computer 
security  engenders  viewpoints  that  largely  ignore  the  aspects  of  recovery  from  the 
compromise  of  a  system  and  aspects  of  maintaining  services  during  and  after  an 
intrusion.  Such  an  approach  is  inadequate  to  support  necessary  improvements  in  the 
state  of  the  practice  of  protecting  computer  systems  from  attack.  In  contrast,  the  term 
survivable  systems  refers  to  systems  whose  components  collectively  accomplish  their 
mission  even  under  attack  and  despite  active  intrusions  that  effectively  damage  a 
significant  portion  of  the  system. 

Robustness  under  attack  is  at  least  as  important  as  hardness  or  resistance  to  attack. 
Hardness  contributes  to  survivability,  but  robustness  under  attack  (and,  in  particular, 
recoverability)  is  the  essential  characteristic  that  distinguishes  survivability  from 
traditional  computer  security.  At  the  same  time,  survivability  can  benefit  from  computer 
security  research  and  practice,  and  survivability  can  provide  a  framework  for  integrating 
security  with  other  disciplines  that  can  contribute  to  system  survivability. 

1 .5.2  Survivability  and  Fault  T olerance 

Survivability  requires  robustness  under  conditions  of  intrusion,  failure,  or  accident.  The 
concept  of  survivability  includes  fault  tolerance,  but  is  not  equivalent  to  it.  Fault  tolerance 
relates  to  the  statistical  probability  of  an  accidental  fault  or  combination  of  faults,  not  to 
malicious  attack.  For  example,  an  analysis  of  a  system  may  determine  that  the 
simultaneous  occurrence  of  the  three  statistically  independent  faults  (f1 ,  f2,  and  f3)  will 
cause  the  system  to  fail.  The  probability  of  the  three  independent  faults  occurring 
simultaneously  by  accident  may  be  extremely  small,  but  an  intelligent  adversary  with 
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knowledge  of  the  system’s  internals  can  orchestrate  the  simultaneous  occurrence  of 
these  three  faults  and  bring  down  the  system.  A  fault-tolerant  system  most  likely  does 
not  address  the  possibility  of  the  three  faults  occurring  simultaneously,  if  the  probability 
of  occurrence  is  below  a  threshold  of  concern.  A  survivable  system  requires  a 
contingency  plan  to  deal  with  such  a  possibility. 

Redundancy  is  another  factor  that  can  contribute  to  the  survivability  of  systems. 
However,  redundancy  alone  is  insufficient  since  multiple  identical  backup  systems  share 
identical  vulnerabilities.  A  survivable  system  requires  each  backup  system  to  offer 
equivalent  functionality,  but  significant  variance  in  implementation.  This  variance  thwarts 
attempts  to  compromise  the  primary  system  and  all  backup  systems  with  a  single  attack 
strategy. 

1 .6  The  Current  State  of  Practice  in  Survivabie  Systems 


Much  of  today’s  research  and  practice  in  computer-systems  survivability  takes  a 
perilously  narrow,  security-based  view  of  defense  against  computer  intrusions.  This 
narrow  view  is  dangerously  incomplete  because  it  focuses  almost  exclusively  on 
hardening  a  system  (e.g.,  using  firewall  technology  or  an  orange  book  approach  to  host 
protection)  to  prevent  a  break-in  or  other  malicious  attack.  This  view  does  little  about 
how  to  detect  an  intrusion  or  what  to  do  once  an  intrusion  has  occurred  or  is  under  way. 
This  view  is  also  accompanied  by  evaluation  techniques  that  limit  their  focus  to  the 
relative  hardness  of  a  system,  as  opposed  to  a  system’s  robustness  under  attack  and 
ability  to  recover  compromised  capabilities. 

The  architecture  of  secure  bounded  systems  is  built  upon  the  existence  of  a  security 
policy  and  its  enforcement,  which  is  imposed  by  the  exercise  of  administrative  control.  In 
contrast,  an  unbounded  system  has  no  administrative  control  with  which  to  impose 
global-security  policy.  For  instance,  on  the  Internet  today  the  backbone  architecture 
exists  independent  of  security  policy  considerations  because  there  is  no  global 
administrative  control. 

Affordability  is  always  a  significant  factor  in  the  design,  implementation,  and 
maintenance  of  systems  that  support  the  national  infrastructure  (e.g.,  the  power  grid,  the 
public  switched  communications  networks,  and  the  financial  networks)  and  our  national 
defense.  In  fact,  the  trend  toward  increased  sharing  of  common  infrastructure 
components  in  the  interest  of  economy  virtually  ensures  that  the  civilian  networked 
information  infrastructure,  and  its  vulnerabilities  will  always  be  an  inseparable  part  of  our 
national  defense. 

Practical,  affordable  systems  are  almost  never  100%  customized,  but  rather  are 
constructed  from  commonly  available  off-the-shelf  components  with  internal  structures 
that  are  well  known.  The  trend  toward  developing  systems  through  integration  and  reuse 
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instead  of  customized  design  and  coding  efforts  is  a  cornerstone  of  modern  software 
engineering.  Unfortunately,  the  intellectual  complexity  associated  with  software  design, 
coding,  and  testing  virtually  ensures  that  exploitable  bugs  can  and  will  be  discovered  in 
commercial  and  public  domain  products  with  internal  structures  that  are  available  for 
analysis.  When  these  products  are  incorporated  as  components  of  larger  systems,  those 
systems  become  vulnerable  to  attack  strategies  based  on  the  exploitable  bugs.  Popular 
commercial  and  public-domain  components  offer  attackers  a  ubiquitous  set  of  targets 
with  well-known  and  typically  unvarying  internal  structures.  This  lack  of  variability  among 
components  translates  into  a  lack  of  variability  among  systems.  These  systems 
potentially  allow  a  single  attack  strategy  to  have  a  wide-ranging  and  devastating  impact. 

The  natural  escalation  of  offensive  threats  versus  defensive  countermeasures  has 
demonstrated  time  and  again  that  no  practical  systems  can  be  built  that  are  invulnerable 
to  attack.  Despite  best  efforts,  there  can  be  no  assurance  that  systems  will  not  be 
breached.  Thus,  the  traditional  view  of  information  systems  security  must  be  expanded 
to  encompass  the  specification  and  design  of  system  behavior  that  helps  the  system 
survive  in  spite  of  active  intrusions.  Only  then  can  systems  be  created  that  are  robust  in 
the  presence  of  attack  and  are  able  to  survive  attacks  that  cannot  be  completely 
repelled. 

In  short,  the  nature  of  contemporary  system  development  dictates  that  even  hardened 
systems  can  and  will  be  broken.  Therefore,  survivability  must  be  designed  into  systems 
to  help  avoid  the  potentially  devastating  effects  of  system  compromise  and  failure  due  to 
intrusion. 

1.6.1  Incident  Handling  Has  Enhanced  Survivability 

Although  applying  the  term  survivability  to  computer  systems  is  relatively  new,  the 
practice  of  survivability  is  not.  Much  of  the  survivability  practice  to  date  has  been  in  the 
realm  of  incident  response  (IR)  teams.  In  fact,  the  CERT®  Coordination  Center 
(CERT/CC)^  has,  throughout  its  history,  enhanced  system  survivability  in  the  Internet 
community.  The  CERT/CC  provides  incident  response  services  (helping  organizations 
respond  to  and  recover  from  incidents)  and  publishes  and  distributes  vulnerability 
advisories  (akin  to  public  health  notices).  Traditionally,  the  CERT/CC  has  been 
concerned  about  survivability  and  has  been  successful  in  helping  sites  with  risk 
mitigation  and  recovery. 

The  experience  of  the  CERT  Coordination  Center  has  shown  that  how  organizations 
respond  to  and  recover  from  computer  intrusions  is  at  least  as  important  as  the  steps 
they  take  to  prevent  them.  We  believe  that  widespread  availability  and  use  of  survivable 
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systems  by  the  Internet  community  and  throughout  the  Internet  infrastructure  will  provide 
the  best  hope  for  the  dramatic  improvements  necessary  to  transform  the  Internet  into  a 
survivable,  networked  information  system  of  systems.  Survivable  systems  will  help  make 
the  Internet  a  viable  medium  for  the  conduct  of  commerce,  defense,  and  government. 
This  medium  will  also  enable  the  support  of  major  elements  of  the  national  infrastructure 
(e.g.,  power  grid,  pubiic  switched  network,  and  air  traffic  control). 

1 .6.2  Firewalls  Embody  the  Current  State  of  Practice 

Currently,  little  of  the  basic  technology  in  security  engineering  and  system  integration 
applies  to  unbounded  systems.  Instead,  current  practice  assumes  that  the  capability 
exists  to  identify,  define,  and  characterize  the  extent  of  administrative  control  over  a 
system,  all  access  points  to  that  system,  and  all  signals  that  may  appear  at  those  access 
points.  In  unbounded  systems,  such  as  the  current  Internet  and  the  future  National 
Information  Infrastructure,  these  boundary  conditions  cannot  be  fully  determined. 

The  current  state  of  practice  in  survivability  and  security  evaluation  tends  to  treat 
systems  and  their  environments  as  static  and  unchanging.  However,  the  survivability 
and  security  of  systems  in  fact  degrades  over  time  as  changes  occur  in  their  structures, 
configurations,  and  environments,  and  as  knowledge  of  their  vulnerabilities  spreads 
throughout  the  intruder  community. 

On  the  Internet  today,  the  cornerstone  of  security  is  the  notion  of  a  firewall,  a  logically 
bounded  system  within  a  physically  unbounded  one.  We  assert  that  bounded-system 
thinking  wXhin  unbounded  domains  leads  to  security  designs  and  architectures  that  are 
fundamentally  flawed  from  a  survivability  perspective.  One  notable  example  is  the  use  of 
a  firewall  as  the  basic  security  component  of  the  Internet.  This  approach  is  severely 
limited  and  can  be  readily  circumvented  by  exploiting  the  fundamental  differences 
between  bounded  and  unbounded  systems.  Traditional  firewalls  are  the  state  of  the  art 
for  security  architectures,  but  not  for  survivable  systems,  because  they  are  passive, 
filter-only  devices.  The  addition  of  active  components,  such  as  detection  and  a  dynamic- 
response  capability,  will  allow  firewalls  to  play  a  role  in  survivable  systems. 
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2.  Defining  Requirements  for  Survivable  Systems 


Survivability  requirements  can  vary  substantially  depending  on  system  scope,  criticality, 
and  the  consequences  of  failure  and  interruption  of  service.  Categories  of  requirements 
definitions  for  survivable  systems  include  function,  usage,  development,  operation,  and 
evolution.  In  this  section,  we  present  definitions  of  survivability  requirements,  ways  in 
which  these  requirements  can  be  expressed,  and  their  impact  on  system  survivability. 

The  new  paradigm  for  system  requirements  definition  and  design  is  characterized  by 
distributed  services,  distributed  logic,  distributed  code  (including  executable  content), 
distributed  hardware,  a  shared  communications  and  routing  infrastructure,  diminished 
trust,  and  a  lack  of  unified  administrative  control.  Assuring  the  survivability  of  mission- 
critical  systems  developed  under  this  new  paradigm  is  a  formidable  high-stakes  effort  for 
software  engineering  research.  This  effort  requires  that  traditional  computer  security 
measures  be  augmented  by  new  and  comprehensive  system  survivability  strategies. 

2.1  Expressing  Survivability  Requirements 

The  definition  and  analysis  of  survivability  requirements  is  a  critical  first  step  in  achieving 
system  survivability  [Linger  98].  Figure  2  depicts  an  iterative  model  for  defining  these 
requirements.  Survivability  must  address  not  only  requirements  for  software  functionality, 
but  also  requirements  for  software  usage,  development,  operation,  and  evolution.  Thus, 
five  types  of  requirements  definitions  are  relevant  to  survivable  systems  in  the  model. 
These  requirements  are  discussed  in  detail  in  the  following  subsections. 
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Figure  2:  Requirements  Definition  for  Survivable  Systems 


System/Survivability  Requirements:  The  term  system  requirements  refers  to 
traditional  user  functions  that  a  system  must  provide.  For  example,  a  network 
management  system  must  provide  functions  to  enable  users  to  monitor  network 
operations,  adjust  performance  parameters,  etc.  System  requirements  also  include  non¬ 
functional  aspects  of  a  system,  such  as  timing,  performance,  and  reliability.  The  term 
survivability  requirements  refers  to  the  capabilities  of  a  system  to  deliver  essential 
services  in  the  presence  of  intrusions  and  compromises  and  to  recover  full  services. 
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Figure  3  depicts  the  integration  of  survivability  requirements  with  system  requirements  at 
node  and  network  levels. 


Network-Level  Emergent  Behavior  Requirements 


Figure  3:  Integrating  Survivability  Requirements  with  System  Requirements 

Survivability  requires  that  system  requirements  be  organized  into  essential  services  and 
non-essential  services.  Essential  services  must  be  maintained  even  during  successful 
intrusions;  non-essential  services  are  recovered  after  intrusions  have  been  handled. 
Essential  services  may  be  stratified  into  any  number  of  levels,  each  embodying  fewer 
and  more  vital  services  as  the  severity  and  duration  of  intrusion  increases.  Thus, 
definitions  of  requirements  for  essential  services  must  be  augmented  with  appropriate 
survivability  requirements. 

As  shown  in  Figure  2,  survivable  systems  may  also  include  legacy  and  acquired  COTS 
components  that  were  not  developed  with  survivability  as  an  explicit  objective.  Such 
components  may  provide  both  essential  and  non-essential  services  and  may  require 
functional  requirements  for  isolation  and  control  through  wrappers  and  filters  to  permit 
their  safe  use  in  a  survivable  system  environment. 

Figure  3  shows  that  survivability  itself  imposes  new  types  of  requirements  on  systems. 
These  new  requirements  include  the  resistance  \o,  recognition  of  and  recovery  from 
intrusions  and  compromises,  and  adaptation  and  evolution  to  diminish  the  effectiveness 
of  future  intrusion  attempts.  These  survivability  requirements  are  supported  by  a  variety 
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of  existing  and  emerging  survivability  strategies,  as  noted  in  Figure  2  and  discussed  in 
more  detail  below. 

Finally,  Figure  3  depicts  emergent  behavior  requirements  at  the  network  level.  These 
requirements  are  characterized  as  e/nergenf  because  they  are  not  associated  with 
particular  nodes,  but  rather  emerge  from  the  collective  behavior  of  node  services  in 
communicating  across  the  network.  These  requirements  deal  with  the  survivability  of 
overall  network  capabilities  (e.g.,  capabilities  to  route  messages  between  critical  sets  of 
nodes  regardless  of  how  intrusions  may  damage  or  compromise  network  topology). 

We  envision  survivable  systems  that  are  capable  of  adapting  their  behavior,  function, 
and  resource  allocation  in  response  to  intrusions.  For  example,  when  necessary, 
functions  and  resources  devoted  to  non-essential  services  could  be  reallocated  to  the 
delivery  of  essential  services  and  to  intrusion  resistance,  recognition,  and  recovery. 
Requirements  for  such  systems  must  also  specify  how  the  system  should  adapt  and 
reconfigure  itself  in  response  to  intrusions. 

Systems  can  exhibit  large  variations  in  survivability  requirements.  Small  local  networks 
may  require  few  or  no  essential  services  and  recovery  times  measured  in  hours. 
Conversely,  large-scale  networks  of  networks  may  require  a  core  set  of  essential 
services,  automated  intrusion  detection,  and  recovery  times  measured  in  minutes. 
Embedded  command  and  control  systems  may  require  essential  services  to  be 
maintained  in  real  time  and  recovery  times  measured  in  milliseconds. 

The  attainment  and  maintenance  of  survivability  consume  resources  in  system 
development,  operation,  and  evolution.  The  resources  allocated  to  a  system’s 
survivability  should  be  based  on  the  costs  and  risks  to  an  organization  associated  with 
the  loss  of  essential  services. 

Usage/Intrusion  Requirements:  Survivable-system  testing  must  demonstrate  the 
correct  performance  of  essential  and  non-essential  system  services  as  well  as  the 
survivability  of  essential  services  under  intrusion.  Because  system  performance  in 
testing  (and  operation)  depends  totally  on  the  system’s  use,  an  effective  approach  to 
survivable-system  testing  is  based  on  usage  scenarios  derived  from  usage  models 
[Mills  92,  Trammell  95]. 

Usage  models  are  developed  from  usage  requirements.  These  requirements  specify 
usage  environments  and  scenarios  of  system  use.  Usage  requirements  for  essential  and 
non-essential  services  must  be  defined  in  parallel  with  system  and  survivability 
requirements.  Furthermore,  intruders  and  legitimate  users  must  be  considered  equally. 
Intrusion  requirements  that  specify  intrusion-usage  environments  and  scenarios  of 
intrusion  use  must  be  defined  as  well.  In  this  approach,  intrusion  use  and  legitimate  use 
of  system  services  are  modeled  together. 


18 


CMU/SEI-97-TR-013 


Figure  4  depicts  the  relationship  between  legitimate  and  intrusion  use.  Intruders  may 
engage  in  scenarios  beyond  legitimate  scenarios,  but  may  also  employ  legitimate  use  for 
purposes  of  intrusion  if  they  gain  the  necessary  privileges. 


Figure  4:  The  Relationship  Between  Legitimate  and  Intrusion  Usage 

Development  Requirements:  Survivability  places  stringent  requirements  on  system 
development  and  testing  practices.  Inadequate  functionality  and  software  errors  can 
have  a  devastating  effect  on  system  survivability  and  provide  opportunities  for  intruder 
exploitation.  Sound  engineering  practices  are  required  to  create  survivable  software. 

The  following  five  principles  (four  technical  and  one  organizational)  are  example 
requirements  for  survivable-system  development  and  testing  practices: 

•  Precisely  specify  the  system’s  required  functions  in  all  possible  circumstances  of 
system  use. 

•  Verify  the  correctness  of  system  implementations  with  respect  to  the  functional 
specifications. 

•  Specify  function  usage  in  all  possible  circumstances  of  system  use,  including  intruder 
use. 

•  Test  and  certify  the  system  based  on  function  usage  and  statistical  methods. 

•  Establish  permanent  readiness  teams  for  system  monitoring,  adaptation,  and 
evolution. 
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Sound  engineering  practices  are  required  to  deal  with  legacy  and  COTS  software 
components  as  well. 

Operations  Requirements:  Survivability  places  demands  on  requirements  for  system 
operation  and  administration.  These  requirements  include  defining  and  communicating 
survivability  policies,  monitoring  system  use,  responding  to  intrusions,  and  evolving 
system  functions  as  needed  to  ensure  survivability  as  usage  environments  and  intrusion 
patterns  change  over  time. 

Evolution  Requirements:  System  evolution  responds  to  user  requirements  for  new 
functions.  However,  this  evolution  is  also  necessary  to  respond  to  increasing  intruder 
knowledge  of  system  behavior  and  structure.  In  particular,  survivability  requires  that 
system  capabilities  evolve  more  rapidly  than  intruder  knowledge.  This  rapid  evolution 
prevents  intruders  from  accumulating  information  about  otherwise  invariant  system 
behavior  that  they  need  to  achieve  successful  penetration  and  exploitation. 

2.1.1  Requirements  Definition  for  Essentiai  Services 

The  preceding  discussion  distinguishes  between  essential  and  non-essential  services. 
Each  system  requirement  must  be  examined  to  determine  whether  it  corresponds  to  an 
essential  service.  The  set  of  essential  services  must  form  a  viable  subsystem  for  users 
that  is  complete  and  coherent.  If  multiple  levels  of  essential  services  are  required,  each 
set  of  services  provided  at  each  level  must  also  be  examined  for  completeness  and 
coherence.  In  addition,  requirements  must  be  defined  for  making  the  transition  to  and 
from  essential-service  levels. 

When  distinguishing  between  essentiai  and  non-essential  services,  all  of  the  usual 
requirements-definition  processes  and  methods  can  be  applied.  Elicitation  techniques 
such  as  those  embodied  in  Software  Requirements  Engineering  can  help  to  identify 
essential  services  [Ebert  97].  Tradeoff  and  cost/benefit  analysis  can  help  to  determine 
the  sets  of  services  that  sufficiently  address  business  survivability  risks  and 
vulnerabilities.  Provisions  for  tracing  survivability  requirements  through  design,  code, 
and  test  must  be  established.  As  previously  mentioned,  simulation  of  intrusion  through 
intruder-usage  scenarios  is  included  in  the  testing  process. 

2.1 .2  Requirements  Definition  for  Survivabiiity  Services 

After  specifying  requirements  for  essential  and  non-essential  services,  a  set  of 
requirements  for  survivability  services  must  be  defined.  These  services  can  be 
organized  into  four  general  categories:  resistance,  recognition,  recovery,  and  adaptation 
and  evolution.  These  survivability  services  must  operate  in  an  intruder  environment  that 
can  be  characterized  by  three  distinct  phases  of  intrusion:  penetration,  exploration,  and 
exploitation. 
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Penetration  Phase.  In  this  phase,  an  intruder  attempts  to  gain  access  to  a  system 
through  various  attack  scenarios.  These  scenarios  range  from  random  inputs  by 
hobbyist  hackers  to  well-planned  attacks  by  professional  Intruders.  These  attempts  are 
designed  to  capitalize  on  known  system  vulnerabilities. 

Exploration  Phase.  In  this  phase,  the  system  has  been  penetrated  and  the  intruder  is 
exploring  internal  system  organization  and  capabilities.  By  exploring,  the  intruder  learns 
how  to  exploit  the  access  to  achieve  intrusion  objectives. 

Exploitation  Phase.  In  this  phase,  the  intruder  has  gained  access  to  desired  system 
facilities  and  is  performing  operations  designed  to  compromise  system  capabilities. 

Penetration,  exploration,  and  exploitation  create  a  spiral  of  increasing  intruder  authority 
and  a  widening  circle  of  compromise.  For  example,  penetration  at  the  user  level  is 
typically  a  means  to  find  root-level  vulnerabilities.  User-level  authorization  is  then 
employed  to  exploit  those  vulnerabilities  to  achieve  root-level  penetration.  Finally, 
compromise  of  the  weakest  host  in  a  networked  system  allows  that  host  to  be  used  as  a 
stepping-stone  to  compromise  other  more  protected  hosts. 

Requirements  definitions  for  resistance,  recognition,  recovery,  and  adaptation  and 
evolution  services  help  select  survivability  strategies  to  deal  with  these  phases  of 
intrusion.  Some  strategies,  such  as  firewalls,  are  the  product  of  extensive  research  and 
development  and  currently  are  used  extensively  in  bounded  networks.  New  survivability 
strategies  are  emerging  to  respond  to  the  unique  challenges  of  unbounded  networks. 

Resistance  Service  Requirements.  Resistance  is  the  capability  of  a  system  to  deter 
attacks.  Resistance  is  thus  important  in  the  penetration  and  exploration  phases  of  an 
attack,  before  actual  exploitation.  Current  strategies  for  resistance  include  the  use  of 
firewalls,  authentication,  and  encryption.  Diversification  is  a  resistance  strategy  that  will 
likely  become  more  important  for  unbounded  networks. 

Requirements  for  diversification  must  define  planned  variation  in  survivable  system 
function,  structure,  and  organization,  and  the  means  for  achieving  it.  Diversification  is 
intended  to  create  a  moving  target  and  render  ineffective  the  accumulation  of  system 
knowledge  as  an  intrusion  strategy.  Diversification  also  eliminates  intrusion  opportunities 
associated  with  multiple  nodes  that  execute  identical  software  and  typically  exhibit 
identical  vulnerabilities.  Such  systems  offer  tempting  economies  of  scale  to  intruders, 
since  when  one  node  has  been  penetrated,  all  nodes  can  be  penetrated.  Requirements 
for  diversification  can  include  variation  in  programs,  retained  data,  and  network  routing 
and  communication.  For  example,  systematic  means  can  be  defined  to  randomize 
software  programs  while  preserving  functionality  [Linger  99]. 
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Recognition  Service  Requirements.  Recognition  is  the  capability  of  a  system  to 
recognize  attacks  or  the  probing  that  precedes  attacks.  The  ability  to  react  or  adapt 
during  an  intrusion  is  central  to  the  capacity  of  a  system  to  survive  an  attack  that  cannot 
be  completely  repelled.  To  react  or  adapt,  the  system  must  first  recognize  it  is  being 
attacked.  In  fact,  recognition  is  essential  in  all  three  phases  of  attack. 

Current  strategies  for  attack  recognition  include  both  state-of-the-art  intrusion  detection 
and  mundane  but  effective  techniques  such  as  logging  and  frequent  auditing  as  well  as 
follow-up  investigations  of  reports  generated  by  ordinary  error  detection  mechanisms. 
Advanced  intrusion-detection  techniques  are  generally  of  two  types:  anomaly  detection 
and  pattern  recognition.  Anomaly  detection  is  based  on  models  of  normal  user  behavior. 
These  models  are  often  established  through  statistical  analysis  of  usage  patterns. 
Deviations  from  normal  usage  patterns  are  flagged  as  suspicious.  Pattern  recognition  is 
based  upon  models  of  intruder  behavior.  User  activity  that  matches  a  known  pattern  of 
intruder  behavior  raises  an  alarm. 

Requirements  for  future  survivable  networks  will  likely  employ  additional  strategies  such 
as  self-awareness,  trust  maintenance,  and  black-box  reporting.  Self-awareness  is  the 
process  of  establishing  a  high-level  semantic  model  of  the  computations  that  a 
component  or  system  is  executing  or  has  been  asked  to  execute.  A  system  or 
component  that  understands  what  it  is  being  asked  can  refuse  requests  that  would  be 
dangerous,  compromise  a  security  policy,  or  adversely  impact  the  delivery  of  minimum 
essential  services. 

Trust  maintenance  is  achieved  by  a  system  through  periodic  queries  among  its 
components  (e.g.,  among  the  nodes  in  a  network)  to  continually  test  and  validate  trust 
relationships.  Detection  of  signs  of  intrusion  would  trigger  an  immediate  test  of  trust 
relationships. 

Black-box  reporting  is  a  dump  of  system  information  that  can  be  retrieved  from  a 
crashed  system  or  component  for  analysis  to  determine  the  cause  of  the  crash  (e.g., 
design  error  or  specific  intrusion  type).  This  analysis  can  help  to  prevent  other 
components  from  suffering  the  same  fate. 

A  survivable-system  design  must  include  explicit  requirements  for  recognition  of  attack. 
These  requirements  ensure  the  use  of  one  or  more  of  the  preceding  strategies  through 
the  specification  of  architectural  features,  automated  tools,  and  manual  processes.  Since 
intruder  techniques  are  constantly  advancing,  recognition  requirements  should  be 
frequently  reviewed  and  continuously  improved. 

Recovery  Service  Requirements.  Recovery  is  a  system’s  ability  to  restore  services 
after  an  intrusion  has  occurred.  Recovery  also  contributes  to  a  system’s  ability  to 
maintain  essential  services  during  intrusion. 
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Requirements  for  recoverability  are  what  most  clearly  distinguish  survivable  systems 
from  systems  that  are  merely  secure.  Traditional  computer  security  leads  to  the  design 
of  systems  that  rely  almost  entirely  on  hardening  (i.e.,  resistance)  for  protection.  Once 
security  is  breached,  damage  may  follow  with  little  to  stand  in  the  way.  The  ability  of  a 
system  to  react  during  an  active  intrusion  is  central  to  its  capacity  to  survive  an  attack 
that  cannot  be  completely  repelled.  Recovery  is  thus  crucial  during  the  exploration  and 
exploitation  phases  of  intrusion. 

Recovery  strategies  in  use  today  include  replication  of  critical  information  and  services, 
use  of  fault-tolerant  designs,  and  incorporation  of  backup  systems  for  hardware  and 
software.  These  backup  systems  maintain  master  copies  of  critical  software  in  isolation 
from  the  network.  Some  systems,  such  as  large-scale  transaction  processing  systems, 
employ  elaborate,  fine-grained  transaction  roll-back  processes  to  maintain  the 
consistency  and  integrity  of  state  data. 

Adaptation  and  Evolution  Service  Requirements.  Adaptation  and  evolution  are 
critical  to  maintaining  resistance  to  ever-increasing  intruder  knowledge  of  how  to  exploit 
otherwise  unchanging  system  functions.  Dynamic  adaptation  permanently  improves  a 
system’s  ability  to  resist,  recognize,  and  recover  from  intrusion  attempts.  For  example, 
an  adaptation  requirement  may  be  an  infrastructure  that  enables  the  system  to  inoculate 
itself  against  newly-discovered  security  vulnerabilities  by  automatically  distributing  and 
applying  security  fixes  to  all  network  elements.  Another  adaptation  requirement  may  be 
that  intrusion  detection  rule  sets  are  updated  regularly  in  response  to  reports  of  known 
intruder  activity  from  authoritative  sources  of  security  information,  such  as  the  CERT® 
Coordination  Center. 

Adaptation  requirements  ensure  that  such  capabilities  are  an  integral  part  of  a  system’s 
design.  As  in  the  cases  of  resistance,  recognition,  and  recovery  requirements,  the 
constant  evolution  of  intruder  techniques  requires  that  adaptation  requirements  be 
frequently  reviewed  and  continuously  improved. 
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3.  Survivability  Design  and  Implementation  Strategies 

In  this  section  we  examine  strategies  that  support  the  survivability  of  critical  system 
functions  in  unbounded  networks.  Strategies  for  survivability  in  networked  systems 
depend  on  several  assumptions  and  constraints.  Although  they  may  seem  obvious, 
these  assumptions  and  constraints  must  be  made  explicit.  The  assumptions  differ 
radically  from  the  implicit  assumptions  traditionally  made  for  the  uniprocessor,  multi¬ 
processor,  and  bounded  network  systems  on  which  most  previous  research  and 
development  has  been  based. 

For  unbounded  networks,  we  assume  that 

•  any  individual  node  of  the  network  can  be  compromised 

•  survivability  does  not  require  that  any  particular  physical  component  of  the  network 
be  preserved 

•  only  the  essential  services  of  the  network  as  a  whole  must  survive 

•  for  reasons  of  reliability,  design  error,  user  error,  and  intentional  compromise,  the 
trustworthiness  of  a  network  node  or  any  node  with  which  it  can  communicate  cannot 
be  guaranteed 

In  this  report,  we  primarily  discuss  unbounded  networks.  The  term  unbounded  has  a 
slightly  different  meaning  depending  on  the  purpose  and  situation  involved.  In  all  cases, 
unbounded  networks  relate  to  three  principal  characteristics  that  are  present  in  each 
definition:  a  lack  of  central  physical  or  administrative  control,  absence  of  insight  or  vision 
into  all  parts  of  the  network,  and  no  practical  limit  on  growth  in  the  number  of  nodes  in 
the  network. 

These  assumptions  impose  the  following  constraints  on  the  architecture  of  survivable 
networks  and  on  the  form  of  feasible  survivability  strategies: 

•  There  must  not  be  a  single  point  of  failure  within  the  network.  Essential  sen/ices  are 
distributed  in  a  manner  that  is  not  critically  dependent  on  any  particular  component 
or  node. 

•  Global  knowledge  is  impossible  to  achieve  in  a  distributed  system  [Halpern  84]. 
There  are  no  all-seeing  global  oracles.  Instead,  protocols  define  the  interaction  and 
knowledge  shared  between  nodes. 

•  Each  node  must  continuously  validate  the  trustworthiness  of  itself  and  those  with 
which  it  communicates. 

•  Computations  within  a  given  node  of  an  unbounded  network,  whether  for  essential 
services,  communication,  or  trust  validation,  must  have  costs  that  are  less  than 
proportional  to  the  number  of  nodes  in  the  network. 
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3.1  Four  Aspects  of  Survivability  Solution  Strategies 


As  introduced  in  Section  2,  there  are  four  aspects  of  the  survivability  solution  which  can 
serve  as  a  basis  for  survivability  strategies.  These  four  aspects  are:  resistance, 
recognition,  recovery,  and  system  adaptation  and  evolution.  This  section  summarizes 
the  approaches  in  each  of  these  four  areas. 

There  are  many  techniques  for  dealing  with  these  four  aspects.  Any  or  all  of  the 
techniques  may  apply  to  survivable  systems.  We  do  not  list  all  of  these  techniques  but 
instead  categorize  them  within  the  broader  aspects.  Table  2  contains  the  four  aspects  of 
the  survivability  solution  and  representative  taxonomies  of  respective  strategies. 


Survivability 

Aspect 

Taxonomies  of  Strategies 

Resistance 

•  traditional  security,  including  encryption  and  covert  channels 

•  diversity  and  maximized  differences  in  individual  nodes 

•  analytic  redundancy  and  voting 

•  specialization,  division  of  labor,  trust,  and  information 

•  continuous  validation  of  trust 

•  exhibited  stochastic  properties  and  random  behavior 

Recognition 

•  analytic  redundancy  and  testing  (including  failures  in  software,  encryption, 
and  trust) 

•  intrusion  monitoring  and  suspicious  activities 

•  system  behavior  and  integrity  monitoring 

Recovery 

•  physical  and  information  redundancy 

•  non-local  copies  of  information  resources 

•  oreoaration.  readiness,  contingency  planning,  and  response  teams 

Adaptation  and 
Evolution 

•  general  or  specific  changes  to  resist,  recognize,  or  recover  from  new 
vulnerabilities  that  are  discovered 

•  broadcast  of  warnings  to  other  nodes 

•  broadcast  of  adaptation  and  evolution  strategies 

•  deterrence  through  retaliation  or  punishment 

Table  2:  A  Taxonomy  of  Strategies  Related  to  Survivability 
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3.2  Support  of  Strategies  by  the  Computing  infrastructure 


The  rapid  growth  of  the  Web  and  other  Internet-based  applications  has  encouraged  the 
growth  of  a  computing  infrastructure  to  support  distributed  applications.  While  the  initial 
Web  efforts  concentrated  on  information  publishing,  the  application  domain  has 
expanded  to  encompass  a  much  wider  spectrum  of  an  organization’s  computing  needs. 
The  technical  focus  of  this  growth  has  moved  from  tools  such  as  Web  browsers  or 
servers  to  the  development  of  a  set  of  Internet-compatible,  commercially  provided 
services.  Examples  of  these  services  are  file,  print,  transaction,  messaging,  directory, 
security,  and  object  services  such  as  CORBA  (Common  Object  Request  Broker 
Architecture)  and  DCOM  (Distributed  Component  Object  Model). 

The  commercially  available  distributed  infrastructures  are  in  the  early  phases  of  their 
development  and  do  not  yet  directly  support  system  survivability.  Recognition  is  not  a 
supported  service  and  recovery  is  indirectly  supported  by  a  transaction  server.  Typically, 
an  organization  adopts  such  an  infrastructure  to  lower  costs  by  using  a  common 
infrastructure  for  intranets,  extranets,  and  Internet  applications  and  to  simplify  application 
development  by  capturing  the  complexity  of  distributed  computing  in  the  infrastructure 
rather  than  in  each  application. 

Managing  user-profile  data  is  an  example  of  a  service  that  a  distributed  infrastructure 
can  assume.  One  general  requirement  of  system  survivability  is  to  provide  user 
authentication  and  manage  the  authority  given  to  that  user  for  data  and  systems  access. 
Authentication  can  be  implemented  using  passwords  and  authorizations  that  are 
validated  by  access-control  lists.  However,  in  many  existing  systems,  such  as  database 
applications,  access-control  lists  are  maintained  by  the  application. 

When  system  users,  data,  and  applications  are  geographically  distributed,  the 
maintenance  of  user-profile  data  in  an  application  is  difficult.  A  shared  directory  service, 
which  is  part  of  a  distributed  infrastructure,  can  provide  the  data  storage  capability  and  a 
protocol  such  as  LDAP  (Lightweight  Directory  Access  Protocol)  for  application  access 
and  replace  the  application-specific  access-control  mechanisms.  These  infrastructure 
security  services  can  provide  the  mechanisms  for  user  authentication  such  as  a  public 
key  interface,  mechanisms  to  describe  access  control,  and  the  means  to  define  a 
security  policy.  The  use  of  shared  services  for  user  authentication  and  authorization 
should  reduce  application  and  overall  system  complexity  as  well  as  provide  the  means  to 
define  an  organizational  security  policy. 
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When  this  strategy  is  implemented,  the  system  architecture  is  constrained  by  the 
infrastructure-supplied  services  and  the  protocols  supported.  For  example,  a  survivability 
strategy  may  be  to  exchange  a  primary  service  with  an  alternate  implementation  of  that 
service  if  the  primary  service  has  been  compromised.  At  this  stage  of  infrastructure 
deployment  there  is  some  interoperability  supported  among  services  provided  by 
different  vendors,  however,  there  is  also  significant  integration  of  services  that  makes  it 
difficult  or  impossible  to  replace  a  service,  such  as  a  directory  service,  with  one  from  a 
different  vendor. 

Using  shared  directory  services  also  raises  general  survivability  issues.  A  widely  used 
infrastructure  should  develop  a  robust  set  of  services.  However,  their  wide  use  develops 
a  large  and  knowledgeable  intruder  community  and  a  wide  dissemination  of  information 
about  system  vulnerabilities  and  security  solutions.  A  compromised  or  inaccessible 
directory  can  affect  multiple  applications  and  multiple  sites. 

An  essential  part  of  providing  system  survivability  is  establishing  operational  and 
administrative  procedures  for  system  directories  so  that  system  administrators  can 
monitor  service  and  provide  recovery.  The  design  tradeoff  is  that  implementing 
monitoring  and  recovery  procedures  is  less  costly  using  shared  components  than  using 
an  application-specific  architecture.  Infrastructure  services  provide  generic  support  for 
replication  and  maintenance  of  consistency  across  distributed  sites.  However,  achieving 
overall  mission  survivability  requires  not  only  understanding  the  impact  of  compromised 
access  control  data  and  of  the  design  of  a  recovery  policy,  but  also  knowledge  of  the 
system’s  applications. 

Commercially  available  infrastructure  products  provide  general  services  that  are 
independent  of  application  domain.  Some  of  the  services  listed  in  Figure  3,  however, 
require  application-domain  knowledge.  For  example,  recognition  of  an  intrusion  or 
maintenance  of  trust  among  nodes  requires  knowledge  of  expected  behavior.  A  protocol 
can  ensure  that  information  is  delivered,  but  cannot  validate  the  appropriateness  of  the 
data.  Simple  recovery  mechanisms  can  include  transaction  logs  or  file  restorations;  but 
use  of  transactions,  rollback  strategies,  and  more  advanced  techniques  require  domain 
expertise  to  identify  consistent  application  states  and  the  impact  of  compromised  data. 
The  successful  use  of  such  recovery  strategies  has  been  in  application-centered 
products,  such  as  relational  database  systems  that  manage  relatively  homogeneous 
data  structures.  Applying  such  techniques  to  general  distributed-computing  systems  is 
more  difficult. 
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3.3  Survivability  Design  Observations 


We  can  draw  a  number  of  observations  about  the  questions  and  issues  that  must  be 
addressed  concerning  system  survivability  in  networked  systems. 

3.3.1  Survivability  Requires  Trust  Maintenance 

An  open  issue  is  how  to  determine  the  basis  of  trust  and  how  an  individual  node  of  a 
network  contributes  to  the  survivability  of  the  system’s  essential  services  when 

•  any  node  can  be  unreliable  or  rogue 

•  there  is  no  global  view  or  global  control 

•  nodes  cannot  completely  trust  themselves  or  their  neighbors 

Depending  on  the  application,  it  may  be  possible  through  architectural  design  or 
dynamic  action  within  the  system  to  increase  the  reliability,  visibility,  and  control  of 
components  or  the  trustworthiness  of  participants.  The  only  absolute  basis  for  trust 
maintenance,  however,  is  the  consistency  of  behavioral  feedback  from  interactions  with 
other  nodes  and  independent  verification  of  claimed  actions  from  nodes  not  directly 
involved  in  the  transactions. 

A  closely  related  point  is  the  absence  of  global  view  and  control.  If  unreliable  and 
untrustworthy  components  are  found  to  be  present  in  a  system,  determining  whether  the 
critical  functions  have  been  compromised  may  be  extremely  difficult  without  global  view 
and  control.  If  global  view  and  control  are  absent  (and,  in  general,  they  will  be)  this 
condition  does  not  preclude  effective  survivable-network  architectures.  In  particular,  it 
should  be  possible  for  individual  nodes  to  generally  contribute  to  the  survivability  goals 
and  at  worst  not  interfere  with  these  goals. 

Genetic  algorithms,  for  example,  achieve  their  effects  through  the  collective  action  of  the 
individual  participants.  These  participants,  however,  cannot  measure  overall 
effectiveness  or  determine  whether  their  contribution  is  positive.  This  example  suggests 
that  survivability  solutions  can  exist  among  emergent  algorithms  that  depend  on 
continuous  interaction  with  neighboring  nodes  but  do  not  require  feedback  for  indications 
of  progress  and  success  [Fisher  99]. 
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3.3.2  Survivability  Analysis  Is  Protocol-Based  Not  Topology-Based 

Another  implication  for  networked  systems  is  that  the  important  aspects  of  their 
architecture  from  the  viewpoint  of  survivability  relate  to  the  conventions  and  rules  of 
interaction  between  neighboring  nodes  and  that  the  network  topology  is  largely 
irrelevant.  That  is,  network  architectures  must  be  specified,  compared,  and  measured  in 
terms  of  their  interactions  and  not  the  topology  of  their  interconnection. 

As  an  example  of  this  kind  of  analysis,  consider  the  general  issue  of  persistence  of  state 
data  for  a  protocol.  Should  a  protocol  maintain  state  information  to  improve  reliability  or 
to  perform  additional  consistency  checks?  What  level  of  checking  should  the 
infrastructure  support?  J.  H.  Saltzer  and  his  colleagues  examined  the  FTP  (file  transfer 
protocol)  and  compared  approaches  that  check  packets  only  at  the  source  and 
destination  nodes  (end-to-end)  to  protocols  that  check  reliability  on  each  hop  of  the 
communications  path  [Saltzer  84].  The  conclusion  was  that  hop-to-hop  checking 
increased  complexity  and  affected  performance  with  little  increase  in  overall  reliability. 

Kenneth  P.  Birman  discusses  such  tradeoffs  in  a  more  general  context  [Birman  96]. 
Properties  such  as  reliability  and  survivability  can  be  enhanced  by  properties  that 
support  fault  tolerance  or  communication  guarantees.  However,  the  cost  of  a  property  to 
support,  say,  uniform  ordering  of  events,  can  be  thousands  of  times  more  costly  than  a 
weaker  property  that  may  require  the  application  to  handle  non-uniform  behavior. 

Similar  arguments  can  be  made  when  you  compare  stateless  architectures  and  non- 
replicated  data  to  maintaining  a  strong  application-level  consistency  requirement.  In  the 
case  of  stateless  architectures  and  non-replicated  data,  the  server  can  be  restarted  and 
the  clients  have  the  responsibility  to  reconnect.  Survivability  requires  tradeoff  analysis 
between  the  responsibilities  of  the  servers  and  the  clients  and  between  end-to-end 
protocol  monitoring  by  the  application  and  general  protocol  monitoring  provided  by  the 
infrastructure.  For  such  a  recovery  strategy,  the  application  level  may  be  the  appropriate 
level  in  which  to  analyze  application-state  and  user  behavior  and  select  appropriate 
recovery  actions. 
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3.3.3  Survivability  Is  Emergent  and  Stochastic 

Survivability  goals  are  emergent  properties  that  are  desired  for  the  system  as  a  whole, 
but  do  not  necessarily  prevail  for  individual  nodes  of  the  system.  This  approach 
contrasts  with  traditional  system  designs  in  which  specialized  functions  or  properties  are 
assured  for  particular  nodes  and  the  composition  of  the  system  must  ensure  that  those 
properties  and  functional  capabilities  are  preserved  for  the  system  as  a  whole.  For 
survivability,  we  must  achieve  system-wide  properties  that  typically  do  not  exist  in 
individual  nodes.  A  survivable  system  must  ensure  that  desired  survivability  properties 
emerge  from  the  interactions  among  the  components  in  the  construction  of  reliable 
systems  from  unreliable  components. 

Survivability  is  inherently  stochastic.  If  survivability  properties  are  emergent,  they  are 
present  only  when  the  number  of  contributing  component  nodes  of  a  system  is 
sufficiently  large.  If  the  number  or  arrangement  of  nodes  falls  below  a  critical  threshold, 
the  attendant  survivability  property  fails.  An  example  of  this  type  of  critical  survivability 
property  is  connectivity  in  a  communications  system. 

You  can  design  the  architecture  of  the  system  to  maximize  the  number  of  paths  between 
any  two  nodes;  but  if  enough  links  are  compromised  to  partition  the  network, 
communication  between  arbitrary  nodes  will  no  longer  succeed.  Thus,  survivability 
properties,  algorithms,  and  architectures  should  be  specified,  viewed,  and  assessed  to 
determine  the  probability  of  their  success  under  given  conditions  of  use  and  not 
determined  as  discrete  quantities. 

3.3.4  Survivability  Requires  a  Management  Component 

The  design  of  a  survivable  system  also  includes  management  operations  and 
administration.  Poor  system  administration  is  a  frequent  source  of  vulnerabilities  at 
centrally  administered  sites.  In  unbounded  network  systems,  system  administration  must 
be  coordinated  across  multiple  sites.  Existing  system  administration  procedures  typically 
assume  a  bounded  environment  and  full  administrative  control  over  the  required 
services.  The  complexity  of  infrastructure  and  the  use  of  services  outside  an 
organization’s  immediate  control  require  expanding  the  administrative  services  and 
providing  a  monitoring  function  as  part  of  the  infrastructure. 


CMU/SEI-97-TR-013 


31 


4.  Research  Directions 


There  are  a  number  of  promising  research  areas  in  survivable  systems.  The  plans  for 

the  Survivable  Network  Technology  team  at  the  SEI  include 

•  adapting  and  developing  architectural  description  techniques  to  adequately  describe 
large-scale  distributed  systems  with  survivability  attributes 

•  representing  intruder  environments  through  intruder  usage  models 

•  creating  an  analysis  method  to  evaluate  survivability  as  a  global  emergent  property 
from  architectural  specification 

•  refining  the  analysis  technology  and  instruments  through  pilot  tests  of  real  distributed 
systems 
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Glossary 


Adaptation  and 
Evolution  Semces 


Essential  Services 


Intrusion 


Network 

Architecture 

Non-Essential 

Services 


Recognition 

Services 

Recovery  Services 


Resistance 

Services 

Survivability 


Survivability 

Requirements 


System 

Requirements 

Unbounded 

Network 

Usage  Model 


Usage  Scenario 


Survivable  system  functions  provided  to  continually  improve  the 
system’s  capability  to  deliver  essential  services,  typically  by 
improving  resistance,  recognition,  and  recovery  capabilities 

Services  to  users  of  a  system  that  must  be  provided  even  in  the 
presence  of  intrusion,  failure,  or  accident 

An  attack  on  a  network  for  purposes  of  gaining  access  to  or 
destroying  privileged  information,  or  disrupting  services  to  legitimate 
users 

A  definition  of  the  high-level  behavior  of  and  connections  among 
nodes  in  a  network,  sufficient  to  evaluate  network  properties 

Services  to  users  of  a  system  that  can  be  temporarily  suspended  to 
permit  delivery  of  essential  services  while  the  system  is  dealing  with 
intrusions  and  compromises. 

Survivable  system  functions  that  detect  attempted  and  successful 
intrusions 

Survivable  system  functions  that  restore  full  services  after  an 
intrusion  has  occurred 

Survivable  system  properties  and  functions  that  make  intrusion 
difficult  and  costly 

The  capability  of  a  system  to  fulfill  its  mission,  in  a  timely  manner,  in 
the  presence  of  attacks,  failures,  or  accidents 

The  definition  of  essential  services  as  well  as  resistance,  recognition, 
recovery,  and  adaptation  and  evolution  functions  that  are  sufficient  to 
achieve  required  levels  of  a  system’s  survivability 

The  definition  of  user  requirements  for  system  services  and  usage, 
for  which  survivability  requirements  can  be  defined 

A  network  characterized  by  topology  and  functionality  that  cannot  be 
determined,  and  by  the  absence  of  centralized  administrative  control 

A  definition  of  all  possible  usage  scenarios  of  a  system,  including 
legitimate  and  intruder  use 

An  instance  of  system  use,  either  legitimate  or  intruder  use 
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